Project Data Description

In this project, we analyze 3,202 stock price and volume data time series traded on the NASDAQ exchange between May 30th and October 30th, 2019. This date range was selected for its distance from significant biological and political disruption to the markets, which can both introduce artificial seasonality and increased random variation into forecasts. Data was sourced as comma-separated values through API from AlphaVantage.

Data Selection

Because of the time constraints involved with directly analyzing each stock’s realization, we developed a loop to process through each file, perform a linear model on each stock and where the slope for price is positive with a slope greater than 0.04 and the price is within an affordable range - between 5 and 50 dollars per share - we then select that stock to assess if the spectral density indicates any wandering behavior based on a peak at zero and no additional peaks thereafter. Because of this wandering, the sample ACFs were also expected to damp exponentially, indicative of non-stationary behavior, potentially trending behavior. This method allowed us to identify seven potentially ideal stocks for signal-plus-noise modeling with postive, profitable trending. From these 7 stocks, we selected one and modeled it. We chose this stock to model based on the stationarity of the noise around the signal.

plotts.sample.wge(df$low, trunc=25, arlimits=T)
files = list.files(path='../Time-Series-Stocks', pattern='*.csv')

for(file in files){
  actualFile <- paste0('../Time-Series-Stocks/',file)
  
  df <- read.csv(actualFile, header=T, strip.white=T)

  # run linear regression to get the signal (average).
  t = seq(1, nrow(df),1)
  fit = lm(df$low~t)
  
  # get the frequency values from the spectral density in the Parzen Window (we want them to wander without season; just trend)
  dbz <- plotts.sample.wge(df$low)[4] # plotting sample plots to see the stocks while they process

  # if the linear coefficient (deterministic signal) for the price is positive and the price is between 5 and 50 (affordable):
  if(fit$coefficients[2] > 0.04 && df$low[nrow(df)] > 5 && df$low[nrow(df)] < 50){
    for(i in 1:(length(dbz$dbz)-1)){
      # if the realization is wandering (based on spectral density):
      if(dbz$dbz[i] > dbz$dbz[i+1]){
        write.table(df$symbol, './models_aic_less_than_0.csv', append=T)
      }
    }
  }
}

Candlestick chart for Exploratory Data Analysis (EDA)

Because we have a high, low, open, and close price, we wanted to visually inspect the relationship between these prices across each data point. Through this visual inspection, we noticed differing amounts of variation within each price sets across all data points. As a result, we decided to engineer two new features representing the difference between the high and low and open and close prices. These two new features allowed our multivariate modeling to ingest additional insights into the dynamic relationships betweeen prices in our data.

urlfile = "https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
df <- read_csv(url(urlfile))
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
df <- data.frame(Date=df$times, coredata(df[,2:5]))

fig <- df %>% plot_ly(x = ~Date, type="candlestick",
          open = ~open, close = ~close,
          high = ~high, low = ~low) 
fig <- fig %>% layout(title = "Candlestick Chart for ACGL")

fig

Candlestick Chart for ACGL

After anlayzing the candlestick plot, we decided to use the low price as the target feature of the model. The reason we chose this is because when a stock is trending up, the low price is quick to point this out because the moving average will often rise above the low price, especially for stronger uptrending behavior. This further provides insight into potential investment profitability.

Below are functions for data portability throughout the project. These are the two source data sets we will use.

stock_data <- function(x){
  urlfile="https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
  df <- read_csv(url(urlfile))
  df$volume <- (df$volume/10000)
  HiLo <- df$high - df$low
  HiClo <- df$high - df$close
  OpHi <- df$open - df$high
  OpClo <- df$open - df$close
  OpLo <- df$open - df$low
  CloLo <- df$close - df$low
  varianceRatio <- (df$open - df$close) / (df$high - df$low) * 100
  spread <- df$high - df$low
  df <- data.frame(cbind(df, varianceRatio, HiLo, HiClo, OpHi, OpClo, OpLo, CloLo))
  return(df)
}

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
stock_data_trans <- function(x){
  urlfile="https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
  df <- read_csv(url(urlfile))
  df2 <- df[1:(nrow(df)-1),]
  df2$volume <- (df2$volume/10000)
  lowww <- artrans.wge(df$low, phi.tr=1)
  HiLo <- df2$high - lowww
  HiClo <- df2$high - df2$close
  OpHi <- df2$open - df2$high
  OpClo <- df2$open - df2$close
  OpLo <- df2$open - lowww
  CloLo <- df2$close - lowww
  varianceRatio <- (df2$open - df2$close) / (df2$high - lowww) * 100
  spread <- df2$high - lowww
  df2$low <- lowww
  dfnew <- data.frame(cbind(df2, varianceRatio, HiLo, HiClo, OpHi, OpClo, OpLo, CloLo))
  return(dfnew)
}

dftrans <- stock_data_trans()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

In this Signal-Plus-Noise model, we perform a hypothesis test on the linear trend to identify using OLS if the trend is possibly deterministic. After positive confirmation from OLS, we then tested this with the Cochrane-Orcutt \(AR(1)\) based hypothesis test, which accounts for serial correlation in determining slope significance. This test also confirmed the signal is a deterministic trend.

Next, we removed the residuals from the trend line and built a model for this data. We then tested the residuals for white noise variance using the Ljung-Box test, which indicated there is not enough evidence to consider the residuals to be more than white noise. Because of the stationarity of the residuals, we were able to estimate a model using the linear slope and adding to it the variation of the residuals.

Forecasting error was measured in terms of Average Squared Error using the last trading week’s data points, for which there were five. The ASE was 0.1654078.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# take a sample of the data, analyze
# plotts.sample.wge(df$low, arlimits=T)

####################
# Signal-Plus-Noise
####################
x = df$low
n = length(x)
t = 1:n
fit = lm(x ~ t)
summary(fit) # there appears to be deterministic trend based on OLS; the p-value is significant so reject the null that it is not
## 
## Call:
## lm(formula = x ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10389 -0.52398 -0.02693  0.34676  1.35981 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.855438   0.123444  282.36   <2e-16 ***
## t            0.072922   0.002122   34.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6126 on 98 degrees of freedom
## Multiple R-squared:  0.9234, Adjusted R-squared:  0.9226 
## F-statistic:  1181 on 1 and 98 DF,  p-value: < 2.2e-16
# deterministic. The null argues that any present trend is random that will eventually traverse such a pattern that the realization 
# will continue to approximate around the mean. 

# Because OLS is not robust to non-stationarity, we apply the Cochrane-Orcutt test to also test the beta coefficient slope of time
# using an aproach that fits an AR(1) model to the residuals:
cfit = cochrane.orcutt(fit) # to confirm with chochrane-orcutt
summary(cfit) # Cochran-Orcutt also provides a significant p-value. Based on this, we assume the slope is not equal to zero and
## Call:
## lm(formula = x ~ t)
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 35.0451902  0.3933902  89.085 < 2.2e-16 ***
## t            0.0698107  0.0063528  10.989 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 97 degrees of freedom
## Multiple R-squared:  0.5546 ,  Adjusted R-squared:  0.55
## F-statistic: 120.8 on 1 and 97 DF,  p-value: < 9.97e-19
## 
## Durbin-Watson statistic 
## (original):    0.39519 , p-value: 1.096e-16
## (transformed): 1.49316 , p-value: 3.732e-03
# therefore, there is deterministic that justifies fitting a signal-plus-noise model instead of an ARMA(p,q). However, ARMA(p,q)
# will be fitted later for comparison.

#x.z = x - fit$coefficients[1] - fit$coefficients[2]*t # derive residuals
x.z = fit$residuals # derive residuals (from the regression line)
ar.z = aic.wge(x.z, p=0:6, type="aic") # find a model to use for approximating the residuals. NOTE: (sigmaHAT_a)^2 = 0.1177843
# ar.z$p is the order p (aic selects p=2 where q=0, as does the bic)

# Transform the stock prices by the autoregressive coefficients of the fitted residuals from the linear regression model. 
# Remove phi from residuals to remove serial correlation and allow us to model
y.trans = artrans.wge(df$low, phi.tr=ar.z$phi)
Signal-Plus-Noise Model

Signal-Plus-Noise Model

# also, transform the predictor variable (time) by the autogregressive coefficeints of the fitted residuals as well
t.trans  = artrans.wge(t, phi.tr=ar.z$phi)
Signal-Plus-Noise Model

Signal-Plus-Noise Model

# Finally, regress the newly transformed stock prices (Y-HAT_t) on the transformed time (T-HAT_t)using ordinary least squares
fitco = lm(y.trans ~ t.trans)
summary(fitco) # check to see if the transformed beta coefficient for the slope is still significant
## 
## Call:
## lm(formula = y.trans ~ t.trans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97241 -0.18510  0.01497  0.20566  0.70938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.320641   0.074125  125.74   <2e-16 ***
## t.trans     0.070037   0.004634   15.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3452 on 96 degrees of freedom
## Multiple R-squared:  0.7041, Adjusted R-squared:  0.701 
## F-statistic: 228.4 on 1 and 96 DF,  p-value: < 2.2e-16
# Evaluating the residuals after Cochrane-Orcutt:
plotts.wge(fitco$residuals)
Signal-Plus-Noise Model

Signal-Plus-Noise Model

acf(fitco$residuals) # residuals appear to be white noise
Signal-Plus-Noise Model

Signal-Plus-Noise Model

ljung.wge(fitco$residuals) # there is not enough evidence based on the ljung-box test to reject the null hypothesis. Therefore, 
## Obs -0.02329353 0.03696201 -0.0148023 -0.1264051 0.02979692 0.1170724 0.0121294 -0.04358782 -0.004767952 -0.09739137 0.04153028 -0.05785741 0.01704484 -0.08708388 -0.04190725 -0.00537386 0.008676478 -0.04774231 -0.123505 0.02173201 -0.01302386 -0.09454542 -0.1174163 -0.0818482
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 12.52529
## 
## $df
## [1] 24
## 
## $pval
## [1] 0.9733174
# we cannot assume that the residuals are not white noise.

# Final Signal-Plus-Noise Model: X_t = 34.855438 + 0.072922*t + Z_t, (sigmaHAT_a)^2 = 0.1177843 summary(fit = lm(x ~ t))
# creates the coefficients
ar.z$phi
## [1]  1.0533533 -0.3193699
# (1 - 1.0533533*B + 0.3193699*B^2)*Z_t = a__t. ar.z$phi from AR(2) creates the coeffients and (sigmaHAT_a)^2 = 0.1177843

# BUT, TO REITERATE: Final Signal-Plus-Noise Model is X_t = 34.855438 + 0.072922*t + Z_t, (sigmaHAT_a)^2 = 0.1177843
est_mod = gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843)
Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

plotts.sample.wge(est_mod)
Signal-Plus-Noise Model

Signal-Plus-Noise Model

## $autplt
##  [1] 1.0000000 0.9651366 0.9118145 0.8565512 0.8119889 0.7768475 0.7465706
##  [8] 0.7193186 0.6922062 0.6658128 0.6410263 0.6151013 0.5857824 0.5597746
## [15] 0.5307624 0.5006611 0.4660858 0.4286500 0.3901764 0.3577466 0.3377352
## [22] 0.3232818 0.3066006 0.2867248 0.2640283 0.2377298
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.3656799  10.0443945   4.2373256   0.3328415   3.7183504
##  [6]  -2.9293040  -4.4417351   3.3619398  -4.9979624  -6.0664768
## [11]  -9.1976096  -2.4333288  -4.4143009  -9.4646269  -5.4455831
## [16]  -7.8048186 -13.4622979  -6.1605324  -7.4926226 -13.0960169
## [21] -22.9914819 -16.3550408  -9.4486304 -17.4872277 -18.9786086
## [26] -12.5301829 -16.6987460 -15.5008055 -11.5993071 -17.8186056
## [31] -18.1266390 -19.1975188 -15.2008114 -21.5084173 -18.7413264
## [36] -14.4695154 -16.8957406 -18.8504054 -12.8515599 -20.4378658
## [41] -18.0737498 -19.9406019 -17.1370027 -17.0168315 -16.6613265
## [46] -14.2727801 -17.1060521 -17.8422981 -17.1279923 -26.4868302
## 
## $dbz
##  [1]  10.6211100   9.9096304   8.7223550   7.0687089   4.9951979
##  [6]   2.6416904   0.3110044  -1.5988486  -2.9302225  -3.8882665
## [11]  -4.6975926  -5.4166271  -6.0512655  -6.6709255  -7.3787573
## [16]  -8.2217302  -9.1548511 -10.0826599 -10.9421629 -11.7433498
## [21] -12.5212558 -13.2727162 -13.9604085 -14.5674614 -15.1205846
## [26] -15.6523341 -16.1653392 -16.6486865 -17.1148075 -17.5949536
## [31] -18.0909271 -18.5424199 -18.8588992 -18.9968950 -18.9955537
## [36] -18.9338563 -18.8788838 -18.8729126 -18.9358535 -19.0569523
## [41] -19.1857241 -19.2516426 -19.2186147 -19.1245018 -19.0552735
## [46] -19.0825604 -19.2220228 -19.4272086 -19.6101446 -19.6834596
plotts.sample.wge(df$low) # the estimated model (above) matches to the actual model (here) on both sample ACFs and sample spectral
Signal-Plus-Noise Model

Signal-Plus-Noise Model

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
# density as well as on partial ACF (below):
pacf(est_mod)
pacf(df$low)

#########################################################################################################
############### Confirmation for repeated model visualization of ACF and Spectral Density ###############
#########################################################################################################
for( i in 1: 5)
{
   SpecDen2 = parzen.wge(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
   lines(SpecDen2$freq,SpecDen2$pzgram, lwd = 2, col = "red")
}
Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

for( i in 1: 5)
{
   ACF2 = acf(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
   lines(ACF2$lag ,ACF2$acf, lwd = 2, col = "red")
}
Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

Signal-Plus-Noise Model

#########################################################################################################
#########################################################################################################
#########################################################################################################

signoise.forecast <- fore.sigplusnoise.wge(df$low, max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)
## 
## Coefficients of Original polynomial:  
## 1.0507 -0.3152 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0507B+0.3152B^2    1.6667+-0.6283i      0.5614       0.0574
##   
## 
Signal-Plus-Noise Model

Signal-Plus-Noise Model

SIGNOISE_ASE = mean((df$low[(nrow(df)-5+1):nrow(df)] - signoise.forecast$f)^2)
SIGNOISE_ASE # 0.133713
## [1] 0.133713

ARIMA(p,d,q) Model

In the ARIMA model, we selected a forecast horizon of five trading days because this timeframe completes a full business week. Models can be fully re-developed on non-trading days. However, unless there are two unit roots, ARIMA will not have enough precedential autocorrelation in the realization for a forecasted trend to continue. Consequently, it will have no option but to converge toward the mean. Because of the lack of a seasonal component, may not model well. However, as seen with the Signal-Plus-Noise model, the realization’s strong signal and weak noise components will work well for an Autoregressive/Moving Average model that tends toward the mean since this is what appears to be guiding the primary growth structure of the share price over time.

Because of the strong, positive signal of the price over time, the slowly dampening ACFs, and the strongest root in the series being a \((1-B)\), we differenced the data once to stationarize the realization. Following this, we identified the differenced data to be best represented by modelling a with an \(AR(5)\) component based on Akaike Information Criterion (AIC) scores. There did not appear to be enough noise in the data to need an MA component. A model was estimated to a \(ARMA(0,0)\) - practically, white noise - by the Bayesian Information Criterion score, but because AIC did not suggest this and we identified enough noise to justify modeling, we proceeded with an \(AR(5)\) component for the ARMA model.

The estimated residuals from the estimated \(ARMA(5,0)\) model at \(K=24\) lags produced a p-value\(=0.9405\) that we could not use to reject the null hypothesis, where the null is that the distribution of the residuals of the model are close enough to white noise that we cannot effectively distinguish the difference. Low variance (0.1173) also aided in our conclusion to assume the model represents the data well. We then attempted to identify a new model for the residuals using an AIC score, but the top model provided was a \(ARMA(0,0)\). This concluded our model selection analysis for this model.

Finally, we generated realizations using 99 data points (99 because the difference removed one from the original 100 points) and the estimated model parameters. The spectral density and autocorrelation function plots appeared to match closely to those of the original realization, sugged toward the long-run series mean.

In forecasting with the estimated \(ARIMA(5,1,0)\), the results were well placed with an Average Squared Error (ASE) of 0.2026529. While the model performed well over a 5-point forecast, this model would most likely tend back toward the long-run mean with a larger forecast horizon.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# take a sample of the data, analyze
plotts.sample.wge(df$low, arlimits=T)
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
###########################################################
########################################################### ARMA/ARIMA
###########################################################

#factor.wge(df$low)
est.arma.wge(df$low, p=8, q=3) # the factor table for df$low provides one (1-B) represented as (1-0.9956B), a close-enough 
## 
## Coefficients of Original polynomial:  
## -0.3746 0.7725 0.3401 -0.1343 -0.0358 0.3197 0.1747 -0.0838 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9947B              1.0054               0.9947       0.0000
## 1+1.6828B+0.7314B^2   -1.1504+-0.2091i      0.8552       0.4714
## 1+0.9128B+0.6393B^2   -0.7139+-1.0269i      0.7996       0.3467
## 1-0.9172B+0.5823B^2    0.7875+-1.0474i      0.7631       0.1474
## 1-0.3093B              3.2335               0.3093       0.0000
##   
## 
## $phi
## [1] -0.37456915  0.77252645  0.34011944 -0.13429170 -0.03580418  0.31974554
## [7]  0.17466434 -0.08376272
## 
## $theta
## [1] -1.5943772 -0.8563684 -0.2326950
## 
## $res
##   [1]  0.341382127 -0.339533981  0.372992944  0.045251115  0.282883734
##   [6]  0.543863341  0.027683372 -0.055313770 -0.400705173  0.236964242
##  [11] -0.066224077 -0.109657984  0.373268255 -0.028417015  0.268832218
##  [16]  0.321186364  0.139491809 -0.060923904 -0.207501945 -0.509810842
##  [21]  0.435881683  0.583222371  0.425193624  0.341428403  0.299644074
##  [26]  0.165902559  0.268444218  0.150980721  0.417357079 -0.476663592
##  [31]  0.367971194 -0.066208814  0.238830218 -0.357908092 -0.154882518
##  [36] -0.260220286 -0.126198191 -0.314164842  0.194647972  0.075843457
##  [41]  0.065674888  0.249249868  0.326605797  0.129696758  0.072656694
##  [46] -0.081003028 -0.590131520  0.281198743  0.348383312  0.735499563
##  [51] -0.244213897 -0.177292255  0.295402720 -0.397741559  0.068422622
##  [56]  0.468362947  0.658157668 -0.226977155 -0.082270354  0.152294282
##  [61] -0.788235363  0.327803117 -0.054959335 -0.071767723  0.259976113
##  [66]  0.155307265 -0.019760354  0.873508519  0.739690436  0.089012192
##  [71] -0.039686548 -0.439091020  0.389397691 -0.057910431 -0.435274770
##  [76]  0.127990892  0.132624679  0.123202670  0.370572426 -0.291439393
##  [81]  0.414131539  0.079728839  0.261378249  0.571631165 -0.110336678
##  [86]  0.186149091 -0.454963807 -0.443809259 -0.007272801  0.584743178
##  [91]  0.077730704 -1.018479869  0.311826735  0.461004571  0.090793997
##  [96] -0.174995085  0.530696568 -0.456227949  0.815178176 -0.275805473
## 
## $avar
## [1] 0.1295548
## 
## $aic
## [1] -1.803651
## 
## $aicc
## [1] -0.7413255
## 
## $bic
## [1] -1.491031
## 
## $se.phi
## [1] 0.56643405 0.22018656 0.51330869 0.16824213 0.06855007 0.02066278
## [7] 0.04665927 0.02784275
## 
## $se.theta
## [1] 0.5715739 1.2523688 0.2789225
# approximation. Therefore, we difference the data once. Preliminary evidence suggesting differencing is useful is based 
# on the specified wandering and the damping sample autocorrelations. When using an overfit table, there was a factor close
# to (1-b)^2, but it was not very significant (third most significant using estimated_arma <- est.arma.wge(dftrans, p=15, q=1))
# so we decided to only use a first difference.

dftrans <- artrans.wge(df$low, phi.tr=1)
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

pacf(dftrans)
aic5.wge(dftrans, type="aic") # AIC = -2.001944. p=5, q=0
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1  -2.001944
## 13    4    0  -1.985397
## 14    4    1  -1.966194
## 16    5    0  -1.966140
## 10    3    0  -1.934893
estimated_arma <- est.arma.wge(dftrans, p=5, q=0)
## 
## Coefficients of Original polynomial:  
## 0.1227 -0.1183 -0.1454 -0.2587 -0.0292 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0644B+0.6446B^2    0.8257+-0.9325i      0.8028       0.1347
## 1+0.8220B+0.3777B^2   -1.0882+-1.2098i      0.6146       0.3666
## 1+0.1198B             -8.3479               0.1198       0.5000
##   
## 
estimated_arma$phi
## [1]  0.12269667 -0.11827203 -0.14543411 -0.25874556 -0.02916165
#estimated_arma <- est.arma.wge(dftrans, p=15, q=1)
ljung.wge(estimated_arma$res, p=5, q=0) # suggests there is no serial correlation in the residuals of the model; this is a good fit.
## Obs -0.001755173 -0.0006645789 -0.03189077 -0.0558602 -0.03745613 -0.006650248 -0.07239689 -0.1447362 -0.01711384 -0.08391131 0.05274902 -0.02483137 0.045087 -0.06707539 -0.004246384 0.07628534 0.08560867 -0.0009574519 -0.1290798 0.02213024 0.01599169 -0.07501175 -0.1093225 -0.02137378
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 11.57463
## 
## $df
## [1] 19
## 
## $pval
## [1] 0.9029941
estimated_arma$avar # 0.1240151
## [1] 0.1240151
mean(df$low) # 38.53802
## [1] 38.53802
# (1-B)(1-0.12269667B+0.11827203B^2+0.14543411B^3+0.25874556B^4-0.02916165B^5)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151
# (1-1.12269667B+0.00442464B^2+0.26370614B^3+0.11331145B^4-0.28790721B^5+0.02916165B^6)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151

######
###### Plotting residuals

# the residuals of the model do not appear correlated. The sample autocorrelation and partial autocorrelation plots indicate 
# stationarity across all lags with all lags measured within the 5% limits
plotts.sample.wge(estimated_arma$res, arlimits=T)
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

## $autplt
##  [1]  1.0000000000 -0.0017551730 -0.0006645789 -0.0318907670 -0.0558601955
##  [6] -0.0374561274 -0.0066502482 -0.0723968930 -0.1447362269 -0.0171138352
## [11] -0.0839113138  0.0527490212 -0.0248313665  0.0450869990 -0.0670753909
## [16] -0.0042463842  0.0762853413  0.0856086721 -0.0009574519 -0.1290798025
## [21]  0.0221302437  0.0159916891 -0.0750117483 -0.1093225110 -0.0213737804
## [26]  0.0330170662
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1]  -8.833455090   0.263787556 -11.498115493   2.117874832   2.272952545
##  [6]  -9.309718287   5.018377394   0.289213348  -1.892315270 -11.308615271
## [11]   3.725194391  -0.611197010 -12.632295451  -0.590314051   3.046760336
## [16] -12.371079560   0.004883422   2.510605713   4.337557356  -3.178637232
## [21] -17.271334056  -1.077447786  -2.257551049   3.231804970  -1.965461358
## [26]  -3.043185676  -1.825761971  -7.420727112   2.312103352  -1.619957827
## [31]  -7.841066943   5.001502888  -2.103213653   0.521620232   2.860045258
## [36]  -2.911236603  -9.089689687   4.807535654  -1.618776046 -13.746143879
## [41]  -6.930503276   3.442666207  -8.907109791  -0.989371943   3.916436743
## [46] -14.309350664   0.883844166   0.342804251  -6.422777409
## 
## $dbz
##  [1] -2.1421139616 -1.5268677158 -0.7764179465 -0.0840477881  0.4450188370
##  [6]  0.7709948539  0.8934228747  0.8371033041  0.6469486934  0.3863575660
## [11]  0.1329634098 -0.0354873769 -0.0674530562  0.0372285907  0.2260959001
## [16]  0.4205848653  0.5490294530  0.5673324474  0.4653779820  0.2642966345
## [21]  0.0072262109 -0.2555822359 -0.4825978402 -0.6484742078 -0.7390119560
## [26] -0.7427274437 -0.6522279590 -0.4757583702 -0.2446363728 -0.0055391248
## [31]  0.1965782737  0.3316246748  0.3873844148  0.3649753175  0.2727409570
## [36]  0.1240460268 -0.0586583169 -0.2366934318 -0.3546826477 -0.3582603902
## [41] -0.2277399300 -0.0007442644  0.2413219942  0.4108729200  0.4464568416
## [46]  0.3298807324  0.0947313908 -0.1721868052 -0.3517455314
par(mfrow=c(3,1))
plot(estimated_arma$res, ylab="ARIMA Residuals", xlab = "Trading Days", main = "ARIMA(5,1,1) Model Residuals over Time", lwd=2, type="l")
abline(h=mean(estimated_arma$res), col="blue")
acf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits
pacf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

aic5.wge(estimted_arma$res, type="aic") # White noise is the top model selected for the residuals, by AIC (ARMA(0,0))
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Error in aic calculation at 0 0 
## Error in aic calculation at 0 1 
## Error in aic calculation at 0 2 
## Error in aic calculation at 1 0 
## Error in aic calculation at 1 1 
## Error in aic calculation at 1 2 
## Error in aic calculation at 2 0 
## Error in aic calculation at 2 1 
## Error in aic calculation at 2 2 
## Error in aic calculation at 3 0 
## Error in aic calculation at 3 1 
## Error in aic calculation at 3 2 
## Error in aic calculation at 4 0 
## Error in aic calculation at 4 1 
## Error in aic calculation at 4 2 
## Error in aic calculation at 5 0 
## Error in aic calculation at 5 1 
## Error in aic calculation at 5 2 
## Five Smallest Values of  aic
##      p    q        aic
## 1    0    0     999999
## 2    0    1     999999
## 3    0    2     999999
## 4    1    0     999999
## 5    1    1     999999
######
######

######
###### Compare estimated model to differenced data

# We generated a model of 99 data points using the ARMA(5,1) identified by aic5 with a variance equal to that estimated from the
# differenced data, which is 0.1172604 (select to see origin highlighted above)
est_mod <- gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar, sn=40)
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

plotts.sample.wge(est_mod, arlimits=T) # estimated model
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

## $autplt
##  [1] 1.0000000 0.9319568 0.8616508 0.8115232 0.7837261 0.7798462 0.7686385
##  [8] 0.7540206 0.7238039 0.6950334 0.6636351 0.6243891 0.5993385 0.5769297
## [15] 0.5519124 0.5366042 0.5296241 0.5194643 0.4818245 0.4413453 0.3916907
## [22] 0.3539872 0.3181363 0.2818305 0.2584268 0.2245354
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1]  15.4662467   6.4612009   2.8156959  -6.5831271  -0.3491412
##  [6]   0.6467012 -23.4462233 -11.9860652  -4.2541616  -2.3339334
## [11]  -4.9973000  -0.5165929  -8.4672734  -5.9260513 -12.5070338
## [16]  -1.5329287  -9.4007029  -4.2193607 -12.8919484 -17.0150048
## [21]  -6.9836378 -14.0339519  -5.8623391 -15.3945590 -11.2233206
## [26] -24.0796617 -10.8749229 -18.3577480 -11.7127985  -7.5363064
## [31] -14.6647063 -14.4603439 -11.9075717 -18.9642429 -19.7115192
## [36] -13.9330289 -14.7859997 -12.3169328 -15.2897638 -15.2611997
## [41]  -7.5246856 -25.4508518 -13.1571835 -18.5429171 -14.5183414
## [46] -11.9244393 -15.4696702 -20.2139032 -16.6394603
## 
## $dbz
##  [1]  10.3894114   9.7014613   8.5461122   6.9166900   4.8252742
##  [6]   2.3519955  -0.2434927  -2.4200960  -3.6721480  -4.1238920
## [11]  -4.2332485  -4.2725076  -4.3507834  -4.5391300  -4.8913942
## [16]  -5.4201345  -6.0871771  -6.8234956  -7.5683596  -8.2978154
## [21]  -9.0157797  -9.7227007 -10.4014424 -11.0315750 -11.6029838
## [26] -12.1026903 -12.4960138 -12.7466330 -12.8695657 -12.9469286
## [31] -13.0762896 -13.3067957 -13.6115217 -13.9016587 -14.0770531
## [36] -14.0895034 -13.9687564 -13.7936668 -13.6492815 -13.6034164
## [41] -13.6994736 -13.9536233 -14.3531696 -14.8590436 -15.4153465
## [46] -15.9631305 -16.4485233 -16.8195565 -17.0238291
plotts.sample.wge(df$low, arlimits=T) # estimated model
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
#################################################################################################################################################
############### Confirmation that repeated estimated model visualizations of ACF and Spectral Density match that of original data ###############
#################################################################################################################################################

for(i in 1:2)
{
   SpecDen2 = parzen.wge(gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar), plot = T)
   lines(SpecDen2$freq,SpecDen2$pzgram, lwd = 2, col = "red")
}
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

for(i in 1:2)
{
   ACF2 = acf(gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar), plot = T)
   lines(ACF2$lag ,ACF2$acf, lwd = 2, col = "red")
}
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

#########################################################################################################
#########################################################################################################
#########################################################################################################

# after comparing sample spectral density and ACF plots, we comapred the AIC-estimated models to compare identified models
aic5.wge(dftrans)
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1  -2.001944
## 13    4    0  -1.985397
## 14    4    1  -1.966194
## 16    5    0  -1.966140
## 10    3    0  -1.934893
aic5.wge(est_mod2) # ARMA(5,0) and ARMA(5,1) are top in both models, indicating a potentially useful model..
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Error in aic calculation at 0 0 
## Error in aic calculation at 0 1 
## Error in aic calculation at 0 2 
## Error in aic calculation at 1 0 
## Error in aic calculation at 1 1 
## Error in aic calculation at 1 2 
## Error in aic calculation at 2 0 
## Error in aic calculation at 2 1 
## Error in aic calculation at 2 2 
## Error in aic calculation at 3 0 
## Error in aic calculation at 3 1 
## Error in aic calculation at 3 2 
## Error in aic calculation at 4 0 
## Error in aic calculation at 4 1 
## Error in aic calculation at 4 2 
## Error in aic calculation at 5 0 
## Error in aic calculation at 5 1 
## Error in aic calculation at 5 2 
## Five Smallest Values of  aic
##      p    q        aic
## 1    0    0     999999
## 2    0    1     999999
## 3    0    2     999999
## 4    1    0     999999
## 5    1    1     999999
######
######

######
###### Forecast and Measure Average Squared Errors (ASE)

# Model Forecast and ASE
# Forecasting the differenced data using the parameters estimated from it (we do not apply a d=1)
nonseasonal.forecast <- fore.aruma.wge(df$low, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1,n.ahead=5, lastn=T)
ARIMA(p,d,q) Model

ARIMA(p,d,q) Model

ARIMA_ASE = mean((df$low[(nrow(df)-5+1):nrow(df)] - nonseasonal.forecast$f)^2)
ARIMA_ASE# 0.2026529
## [1] 0.2026529
######
######

Multivariate Regression Time Series Modeling

Following the univariate modeling of this time series, we decided to enhance our model with multivariate factors, both additive and multiplicative. As previously mentioned, based on the candlestick plot we created for exploratory data analysis, we were interested in exploring the relationship of the ranges of open to close and high to low prices as related to the low price. The features we engineered for this aided in describing the data and enhancing the models.

The multivariate models applied include Vector Autoregressive and Neural Network with Multi-Layer Perceptron (MLP) models, in addition to an ensemble model combining the forecasts of the univariate Signal-Plus-Noise and the multivariate MLP model. The ASE scores for all multivariate models outperformed all univariate models. However, an ensemble of Signal-Plus-Noise and the MLP model provides a safeguard of overfitting from a multivariate model given the dependency on additional variables that can change significantly very quickly, disrupting a complex forecast.

Vector Autoregressive Modeling

Vector Autoregressive Modeling without Trend

The vector Autoregressive (VAR) models used take all 75 points of the training data - including for the exogenous variables - to bulid a model. We forecasted 5 data points ahead from this trainin set of 95 out of 100 data points and then generated the overall ASE and the 5-point rolling window ASE for model comparison.

The autoregressive model identifed is an AR(2) with an AIC score of -2.4212.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)

## $autplt
##  [1]  1.000000000  0.332654213 -0.015323504 -0.185197647  0.005818653
##  [6]  0.124607489  0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175  0.213392652  0.211975490  0.061996282 -0.059048222
## [26] -0.052510741
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  -7.38499578  -0.95161030   3.39895369   3.13204413   7.25816141
##  [6]  -3.34022975   1.45696680  -3.16420812   0.98653170  -0.01510512
## [11]   0.74899492  -3.54602921  -7.04898470   6.69701287   0.21917900
## [16]   3.26040620  -2.30071218   5.83627330   0.12105751   4.40156709
## [21]   1.31507769   1.49404141   4.21104386  -7.07865344 -14.93552255
## [26]  -1.90529423  -9.05704419   1.17291691  -2.44739663 -15.18817408
## [31]  -3.82537165  -8.97778373  -1.31847009  -8.52554281  -9.96514339
## [36]  -0.84871304  -2.04248679  -9.19693386  -5.37219836  -6.41835850
## [41]  -0.84106671 -15.69187726   0.78348200  -7.78153910 -19.81844091
## [46]  -1.92362217  -0.03109830  -1.67305303  -6.34494742  -2.36851032
## 
## $dbz
##  [1]  0.7617948  1.3396040  1.9342166  2.3252670  2.4227279  2.2272880
##  [7]  1.8074853  1.2989121  0.8916181  0.7572102  0.9377718  1.3279105
## [13]  1.7766211  2.1767262  2.4818412  2.6857904  2.7968223  2.8178953
## [19]  2.7371907  2.5301127  2.1698520  1.6414415  0.9557012  0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

df2 <- df[1:(nrow(df)-5),]

VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC picked p=1 with AIC -2.3235
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      5      2      2      5 
## 
## $criteria
##                  1           2           3           4          5
## AIC(n) -2.32345681 -2.42117979 -2.41376593 -2.40362716 -2.4365656
## HQ(n)  -2.26710299 -2.35355521 -2.33487058 -2.31346105 -2.3351287
## SC(n)  -2.18364578 -2.25340655 -2.21803048 -2.17992951 -2.1849057
## FPE(n)  0.09794606  0.08883496  0.08950683  0.09043349  0.0875214
##                  6
## AIC(n) -2.42032626
## HQ(n)  -2.30761863
## SC(n)  -2.14070419
## FPE(n)  0.08897736
lsfit <- VAR(y=cbind(df2$low, df2$volume, df2$HiLo, df2$OpClo), p=2, type="const") # with p=5, this will estimate the coefficients for lag 5

preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.2282559
## [1] 0.2282559
plot(seq(1,100,1), df$low[1:100], type = "l", xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling without Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend

In the VAR model with trend, we introduce a time component into the model as an additive predictor term. The model identifed is an AR(2) with an AIC score of -2.4985.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)

## $autplt
##  [1]  1.000000000  0.332654213 -0.015323504 -0.185197647  0.005818653
##  [6]  0.124607489  0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175  0.213392652  0.211975490  0.061996282 -0.059048222
## [26] -0.052510741
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  -7.38499578  -0.95161030   3.39895369   3.13204413   7.25816141
##  [6]  -3.34022975   1.45696680  -3.16420812   0.98653170  -0.01510512
## [11]   0.74899492  -3.54602921  -7.04898470   6.69701287   0.21917900
## [16]   3.26040620  -2.30071218   5.83627330   0.12105751   4.40156709
## [21]   1.31507769   1.49404141   4.21104386  -7.07865344 -14.93552255
## [26]  -1.90529423  -9.05704419   1.17291691  -2.44739663 -15.18817408
## [31]  -3.82537165  -8.97778373  -1.31847009  -8.52554281  -9.96514339
## [36]  -0.84871304  -2.04248679  -9.19693386  -5.37219836  -6.41835850
## [41]  -0.84106671 -15.69187726   0.78348200  -7.78153910 -19.81844091
## [46]  -1.92362217  -0.03109830  -1.67305303  -6.34494742  -2.36851032
## 
## $dbz
##  [1]  0.7617948  1.3396040  1.9342166  2.3252670  2.4227279  2.2272880
##  [7]  1.8074853  1.2989121  0.8916181  0.7572102  0.9377718  1.3279105
## [13]  1.7766211  2.1767262  2.4818412  2.6857904  2.7968223  2.8178953
## [19]  2.7371907  2.5301127  2.1698520  1.6414415  0.9557012  0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

df2 <- df[1:(nrow(df)-5),] 
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC:  p=1 with AIC -2.3158
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                 1           2           3           4           5
## AIC(n) -2.3288269 -2.49848113 -2.47601445 -2.45354442 -2.45473227
## HQ(n)  -2.2612024 -2.41958579 -2.38584835 -2.35210755 -2.34202465
## SC(n)  -2.1610537 -2.30274568 -2.25231680 -2.20188456 -2.17511021
## FPE(n)  0.0974299  0.08223654  0.08411857  0.08604793  0.08596807
##                  6
## AIC(n) -2.45522002
## HQ(n)  -2.33124163
## SC(n)  -2.14763575
## FPE(n)  0.08595343
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1132188
## [1] 0.1132188
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$df2.low.i..i....trainingSize...1...[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend, Trend and Volume Interaction

Our VAR model with additive terms for trend, volume, the spread between high and low and open and close prices in addition to a multiplicative term for trend and its impact on volume outperformed all other VAR models we built in terms of ASE. Adding the multiplicative term in this model reduced the ASE to roughly half, from 0.2139 to 0.1296. The Autoregressive lag order identified for the model is an AR(2) with an AIC score of -2.4780.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume

ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation

ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
#VARselect(df2$low, lag.max=6, type="const", season=NULL, exogen=data.frame(t[96:100], df2$volume, df2$HiLo, df2$OpClo, t[96:100]*df2$volume)) # AIC:  p=1 with AIC -2.3158

# trend added manually:
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), type="const") # AIC:  p=1 with AIC -2.3081
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                  1           2         3           4           5
## AIC(n) -2.30801515 -2.47800080 -2.455533 -2.43306769 -2.43298718
## HQ(n)  -2.22911981 -2.38783470 -2.354096 -2.32036006 -2.30900879
## SC(n)  -2.11227971 -2.25430315 -2.203873 -2.15344563 -2.12540291
## FPE(n)  0.09949085  0.08395165  0.085877  0.08785085  0.08788582
##                  6
## AIC(n) -2.43521003
## HQ(n)  -2.29996088
## SC(n)  -2.09966355
## FPE(n)  0.08772417
# trend added manually:
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1296242
## [1] 0.1296242
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$df2.low.i..i....trainingSize...1...[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend, Time and Volume Interaction, Time and OpClo Interaction

Finally, we modeled with an interaction between trend and volume and trend and the range between open and close prices (we also attempted for interaction between trend and the range between high and low prices, but this hurt the model, potentially indicating there is greater variation in between the high and low prices than the open and close prices). Nonetheless, the ASE resulting from the forecast was higher than the multiplicative model only including an interaction between trend and trade volume. Therefore, the latter model is the one we selected as the top VAR candidate. As with the other VAR models, this model applied an identified second-order autoregressive with an AIC score of -2.4645.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
plotts.sample.wge(df$volume, arlimits=T)

## $autplt
##  [1]  1.000000000  0.332654213 -0.015323504 -0.185197647  0.005818653
##  [6]  0.124607489  0.059217024 -0.058842730 -0.135327419 -0.128903104
## [11] -0.111554245 -0.136230829 -0.177090893 -0.102374463 -0.010295688
## [16] -0.012154962 -0.091622722 -0.032089912 -0.074665693 -0.032921963
## [21] -0.004581175  0.213392652  0.211975490  0.061996282 -0.059048222
## [26] -0.052510741
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  -7.38499578  -0.95161030   3.39895369   3.13204413   7.25816141
##  [6]  -3.34022975   1.45696680  -3.16420812   0.98653170  -0.01510512
## [11]   0.74899492  -3.54602921  -7.04898470   6.69701287   0.21917900
## [16]   3.26040620  -2.30071218   5.83627330   0.12105751   4.40156709
## [21]   1.31507769   1.49404141   4.21104386  -7.07865344 -14.93552255
## [26]  -1.90529423  -9.05704419   1.17291691  -2.44739663 -15.18817408
## [31]  -3.82537165  -8.97778373  -1.31847009  -8.52554281  -9.96514339
## [36]  -0.84871304  -2.04248679  -9.19693386  -5.37219836  -6.41835850
## [41]  -0.84106671 -15.69187726   0.78348200  -7.78153910 -19.81844091
## [46]  -1.92362217  -0.03109830  -1.67305303  -6.34494742  -2.36851032
## 
## $dbz
##  [1]  0.7617948  1.3396040  1.9342166  2.3252670  2.4227279  2.2272880
##  [7]  1.8074853  1.2989121  0.8916181  0.7572102  0.9377718  1.3279105
## [13]  1.7766211  2.1767262  2.4818412  2.6857904  2.7968223  2.8178953
## [19]  2.7371907  2.5301127  2.1698520  1.6414415  0.9557012  0.1601924
## [25] -0.6606149 -1.4101839 -2.0301647 -2.5352613 -2.9884786 -3.4384161
## [31] -3.8748905 -4.2313776 -4.4336135 -4.4602264 -4.3559544 -4.1882068
## [37] -4.0026903 -3.8148228 -3.6258049 -3.4384625 -3.2600843 -3.0965714
## [43] -2.9492483 -2.8191207 -2.7129243 -2.6423490 -2.6148787 -2.6237376
## [49] -2.6467693 -2.6582701
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.

# find a good count for p for the AR(p) component of the vector autoregressive model. No indication above for lag was required, but :
VARselect(df2$low, lag.max=8, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), type="const") # selects p=1 as the best value for AR(p)
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                 1           2           3           4           5
## AIC(n) -2.2713589 -2.46459045 -2.44187917 -2.41889398 -2.41500677
## HQ(n)  -2.1800536 -2.36187199 -2.32774754 -2.29334919 -2.27804882
## SC(n)  -2.0446087 -2.20949651 -2.15844146 -2.10711249 -2.07488151
## FPE(n)  0.1032257  0.08510686  0.08708604  0.08914122  0.08952502
##                  6           7           8
## AIC(n) -2.41422092 -2.40044609 -2.37762703
## HQ(n)  -2.26584981 -2.24066181 -2.20642959
## SC(n)  -2.04575189 -2.00363329 -1.95247046
## FPE(n)  0.08963886  0.09093375  0.09309378
# Build the model using the p identified for AR(p) and the training data. type="const" because trend is included
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1

preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1703138
## [1] 0.1703138
plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$df2.low.i..i....trainingSize...1...[,1], type = "l", col = "red", lwd=2)

Neural Network Models: Multi-Layered Perceptrons

A multi-layered perceptron operates using a “feedforward” artificial neural network to pass forward information from the input nodes into a hidden layer of nodes that perform logical activation functions before processing into the output node. In the case of this analysis, lags were detected and applied on input to both the predictor regressors as well as the univariate response.

When adding additional predictor variables to the multivariate models, the forecasts converge less toward the long-run mean and more toward the behavior of the realization itself. With this said, the addition of further predictor variables - such as those through natural language processing of market sentiment analysis - can aid in further stabilizing useful multivariate models.

Neural Network without Trend

The Neural Network without trend model we produced generated the highest Average Squared Error of all our MLP models. However, the ASE was still highly competitive with all other models we produced. The ASE for this model is 0.1758194.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume

ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation

ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 6 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 2 regressors included.
## - Regressor 1 lags: (1,2)
## - Regressor 2 lags: (1,2,4)
## Forecast combined using the mean operator.
## MSE: 0.0382.
plot(fit.mlp)

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.1758194
## [1] 0.1835049

Neural Network with Trend

After adding trend into our Neural Network modeling, we successfully lowered the ASE to 0.06421939.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
t <- 1:nrow(df)
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume

ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation

ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 1 hidden node and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.083.
plot(fit.mlp)

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.06421939
## [1] 0.06459411

Neural Network with Trend, Time and Volume Interaction

When we added a multiplicative term for trend and volume, our ASE lowered from 0.0642 to 0.0549. Although the Neural Nets that included a trend predictor were very close in terms of residual error, this model produced the lowest ASE.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
t <- 1:nrow(df)
# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume

ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation

ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 5 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0599.
plot(fit.mlp)

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull))
plot(fore.mlp)

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05486796
## [1] 0.06564621

Neural Network with Trend, Time and Volume Interaction, Time and OpClo Interaction

It is important to note that we did not identify an interaction between time and the spread between high and low prices to benefit the model, but we did find that interaction between time and the spread between open and close prices to be beneficial for reducing ASE. However, the overall reduction from this model increased the ASE from 0.0549 in the model using an interaction between only trend and volume to an ASE of 0.0571.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
t <- 1:nrow(df)

# First Variable has the "time shift"
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume

ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation

ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
High <- ts(df$high)
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv", difforder=NULL)
fit.mlp
## MLP fit with 7 hidden nodes and 15 repetitions.
## Univariate lags: (1)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0654.
plot(fit.mlp)

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull))
plot(fore.mlp)

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05711815
## [1] 0.06303533

Rolling ASEs

To provide additional analysis for the best models of each category, we have provided below the Average Squared Error from a rolling 5-point forecast window.

Rolling ASE for Neural Networks: Multi-Layered Perceptrons

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
#MLP
ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
NN_ASEs = numeric(num_batches)
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

#Rolling ASE
for (i in 0:(num_batches-1))
{
  fit.mlp <- mlp(ts(ts[i:(i+(batch_size-1))]), xreg=data.frame(volumeFull[i:(i+(batch_size-1))], HiLoFull[i:(i+(batch_size-1))], OpCloFull[i:(i+(batch_size-1))]), reps = 15, comb = "mean", hd.auto.type="cv")
  fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
  NN_ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - fore.mlp$mean)^2)
  start = start+1
}

NN_ASEs
## [1] 0.1852031 0.1678351 0.1800357 0.1032628 0.1136498 0.0870527
mean(NN_ASEs) # 0.1468514
## [1] 0.1395065

Rolling ASE for Vector Autoregressive

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
df2 <- df[1:(nrow(df)-5),]
t = 1:100
trainingSize = 91
horizon = 5
ASEHolder_2 = numeric()

for( i in 1:5)
{
  
  VARselect(df2$low[i:(i + (trainingSize - 1))], lag.max=6, season=NULL, exogen=data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))])), type="const") # AIC picked p=2 with AIC -2.44854735
  lsfit <- VAR(y=cbind(df2$low[i:(i + (trainingSize - 1))], exogen = data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))]))), p=1, type="const")
  preds <- predict(lsfit, n.ahead=5)

  ASE_2 = mean((df$low[(trainingSize + i):(trainingSize + i + (horizon) - 1)] - preds$fcst$df2.low.i..i....trainingSize...1...[,1])^2)
  
  ASEHolder_2[i] = ASE_2
  
}

rolling_ASE = mean(ASEHolder_2)
rolling_ASE # 0.1869798
## [1] 0.1869798

Rolling ASE for Signal-Plus-Noise

#Signal + Noise
df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
df2 <- df[1:(nrow(df)-5),]
ts = df$low
t2 = 1:95
batch_size = 96
start = 1
t = 100
num_batches = length(ts)-batch_size+1
sigplus_ASEs = numeric(num_batches)

for (i in 0:(num_batches-1))
{
  
  signoise.forecast <- fore.sigplusnoise.wge(df$low[i:(i+(batch_size-1))], max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)

  sigplus_ASEs[i+1] = mean((df$low[(nrow(df)-4):nrow(df)] - signoise.forecast$f)^2)
}
## 
## Coefficients of Original polynomial:  
## 1.1090 -0.3881 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1090B+0.3881B^2    1.4288+-0.7317i      0.6230       0.0753
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.1077 -0.3812 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1077B+0.3812B^2    1.4531+-0.7156i      0.6174       0.0728
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.1170 -0.3918 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1170B+0.3918B^2    1.4254+-0.7213i      0.6260       0.0746
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.0446 -0.3237 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0446B+0.3237B^2    1.6135+-0.6971i      0.5689       0.0649
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.0617 -0.3442 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0617B+0.3442B^2    1.5423+-0.7257i      0.5867       0.0700
##   
## 

sigplus_ASEs
## [1] 0.13086079 0.15037944 0.20082465 0.08539453 0.09946937
rolling_ASE_sigplus = mean(sigplus_ASEs)
rolling_ASE_sigplus # 0.1333858
## [1] 0.1333858

Rolling ASE for ARIMA(5,1,0)

### Rolling ASE for Neural Networks: Multi-Layered Perceptrons
#ARIMA
df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
ASEs = numeric(num_batches)

for (i in 1:5)
{
  forecasts = fore.aruma.wge(ts[start:(batch_size+i)], phi = c(0.1227,-.1183,.1454,-.2587,.029), d = 1, n.ahead = 5, lastn = TRUE, limits = FALSE)
  ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - forecasts$f)^2)
}

rolling_ASE = mean(ASEs) 

rolling_ASE #0.3935643
## [1] 0.3935643

Rolling ASE for Ensemble Model: Signal-Plus-Noise and Multi-Layered Perceptrons

#Ensemble
mlp_pred = as.numeric(fore.mlp$mean)

signoise = as.numeric(signoise.forecast$f)

en_pred = (mlp_pred + signoise)/2

ASE = mean((ts[(length(ts) - 4):length(ts)] - en_pred)^2)
ASE # 0.06966031
## [1] 0.06732777
rolling_ASEs_ensemble = (sigplus_ASEs + NN_ASEs)/2
rolling_ASEs_ensemble
## [1] 0.15803194 0.15910728 0.19043018 0.09432865 0.10655961 0.10895674
rolling_ASE_mean_ensemble = mean((sigplus_ASEs + NN_ASEs)/2)
rolling_ASE_mean_ensemble
## [1] 0.1362357